Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions

نویسندگان

  • Simon Alexanderson
  • Jonas Beskow
چکیده

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and ighly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a set of 180 hort sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an nimated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the nimated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility tudy with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual ord recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker as on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where ormal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech ondition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, e gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted. 2013 Elsevier Ltd. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Perceptual processing of audiovisual Lombard speech

Seeing the talker improves the intelligibility of speech degraded by noise (a visual speech enhancement effect). This experiment examined whether this enhancement is greater when the speech signals were recorded in noise compared to when they were recorded in quiet. Ten sentences were spoken by four people either in-quiet or whilst they were listening to cocktail party noise (in-noise). The vis...

متن کامل

SynFace - Speech-Driven Facial Animation for Virtual Speech-Reading Support

This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Seco...

متن کامل

The Listening Talker An interdisciplinary workshop on natural and synthetic modification of speech in response to listening conditions

s 52 Can anybody read me? Motion capture recordings for an adaptable visual speech synthesizer Simon Alexanderson & Jonas Beskow . . . . . . . . . . . . . . . . . . . . . . . . . 52 MAGE: A platform for performative speech synthesis, a new approach in exploring applications beyond text-to-speech Maria Astrinaki, Nicolas d’Alessandro & Thierry Dutoit . . . . . . . . . . . . . . 53 Overlap behavi...

متن کامل

Studies of audiovisual speech perception using production-based animation

This paper will summarize our work at Queen's University and ATR Laboratories on cross-modal speech perception and production. Our approach has been to study these two sides of speech together and to use the multi-modal speech production data to parameterize and control audiovisual animation systems. Two approaches to production-based facial animation have been pursued — one statistical and the...

متن کامل

Speech Driven MPEG-4 Facial Animation for Turkish

In this study, a system, that generates visual speech by synthesizing 3D face points, has been implemented. The synthesized face points drive MPEG-4 facial animation. To produce realistic and natural speech animation, a codebook based technique, which is trained with audio-visual data from a speaker, was employed. An audio-visual speech database was created using a 3D facial motion capture syst...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Computer Speech & Language

دوره 28  شماره 

صفحات  -

تاریخ انتشار 2014